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SUMMARY 

To evaluate the calibration of a disease risk prediction tool, the quantity -E/O, i.e., the ratio of the 
expected number of events to the observed number of events, is generally computed. However, because 
of censoring, or more precisely because of individuals who drop out before the termination of the study, 
this quantity is generally unavailable for the complete population study and an alternative estimate 
has to be computed. In this paper, we present and compare four methods to do this. We show that two 
of the most commonly used methods generally lead to biased estimates. Our arguments are first based 
on some theoretic considerations. Then, we perform a simulation study to highlight the magnitude of 
the previously mentioned biases. As a concluding example, we evaluate the calibration of an existing 
predictive model for breast cancer on the E3N-EPIC cohort. 
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1. INTRODUCTION 



Researchers, physicians, as well as the general public, are focusing increasingly on statistical 
models designed to predict the occurrence of a disease. The first corresponding model - the 
Framingham Coronary Risk Prediction Model published in 1976 [13j - was aimed at predicting 
the individual's risk of developing heart disease. Modified versions of this primary model are 
now widely used by physicians to make decisions on prevention and treatment strategies. From 
the late 1980's, researchers published prediction models for the absolute risk of breast cancer 
[2]; [S]; [E], and some prediction tools dealing with other types of cancer have begun to appear 
in the literature over recent years [17]. In a workshop held in 2005, Freedman et al. f7] 
already pointed out the growth of both the number of cancer risk prediction tools and the 
need to ensure that they are rigorously evaluated. 

Two main criteria, discrimination and calibration, are usually retained for evaluation. 
Other criteria may be retained for particular purposes, see [9j for some relevant examples. 
Discrimination measures the ability to segregate the individuals into two groups, those who will 
develop the disease, and those who will not. It is often evaluated by the concordance statistic, 
which is also the area under a receiver operating characteristic (ROC) curve. Calibration - 
our concern here - measures the ability to predict the number of events in the population 
of interest §, usually over a ip-year period, to > 0: it measures the goodness-of-fit of the 
model. Calibration is commonly evaluated by comparing the observed number of events with 
the number of events expected to occur within the fo-year period [H], [15]. By summing the 
estimated io-year risks over all individuals belonging to a given representative sample §„ of 
the population S, we get the expected number of cases E. Considering in its turn the number 
O of cases observed in §„ over the io-year period, the E/O ratio provides an estimator of the 
theoretical quantity £/0 that would be obtained by evaluating the considered model on the 
whole population S, assumed to be infinite (in this asymptotic setting, £ and O would stand 
for rates rather than numbers). A well calibrated model on § would have a theoretic £/0 
equalling 1. Thus, the E/O ratio is usually statistically compared to one to definitely assess 
the model calibration. 
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However, due to either administrative reasons (the patient was fohowed until the end of the 
study but did not develop the disease by that date) or the dropping out phenomenon (including 
both "pure" loss of fohow-up and death for reasons other than the considered disease) data 
are censored in most epidemiologic studies. This implies that the tg-year status regarding the 
disease is unknown for some individuals, and that the only available information for these 
individuals is that they did not develop the disease after z years of follow-up, with < z < to. 
In other words, the number of cases which would have occurred in the population §„ over 
the to-year period is unknown, because of the individuals who dropped out before to years of 
follow-up. To get round this issue, various methods have been proposed and apphed to provide 
estimators alternative to the unobserved E/0 ratio. However, as wih be shown later, most of 
these methods generally lead to biased estimates. In the following Section O we provide the 
derivation of four methods. For each of them we explain its principle as well as its potential 
inaccuracy from a theoretical point of view. The confidence bands associated with each method 
are also presented. Then, in Section O a comparison between the four methods is performed 
on simulated data. Finally we compare these methods on a real sample, the E3N-EPIC cohort, 
in which we evaluate one of the Nurses Health Study based breast cancer prediction tools flB] 
(see Section [4]). These examples support our assertion that two most commonly used methods 
lead to potentially highly biased estimates. 

2. METHODS 

2.1. Notations 

Some notations wih be of particular interest to describe the various methods that have been 
(or can be) used to evaluate calibration. 

Let Y be the random variable of interest (in most cases, Y will stand for the delay between 
the inclusion in the study and the occurrence of the considered disease), and C the censoring 
variable. The observed variables will be denoted by Z = min(y, C) and S = TL{Y < C}, 
where I5 equals 1 if the condition S is true and otherwise (i.e., here, S equals 1 if F < C, 
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and otherwise). We fix to > and consider the evaluation of a fo-year risk prediction tool 
Pto on a given population §. Assume a representative sample S„ = {l,...,n} C S, n > 1, is 
at our disposal. In this setting, our aim is to estimate the theoretical £/0 ratio (relative to 
Ptg on §) on the sample S„. For every individual i G E>„, denote by Zi his observed time of 
follow-up, and e, = ei{to) his expected risk according to Ptg. Throughout, we will assume that 
to < maxjgs^ Zi. Further introduce the random variable Oi = I{yi < to} (i.e., O, equals one 
if Yi < to and otherwise), and set Oj for the realisation of Oi, i = 1, ...,n. 

We will denote by §n,ks the group consisting of individuals for whom the status regarding 
the disease after to years of follow-up is known. They will be referred to hereafter as 'known 
to-status individuals'. This group consists of 

1. individuals who developed the disease before to years of follow-up (t/, < Ci, Zi < to and 
Oj = 1 for these individuals); 

2. individuals who developed the disease after to years of follow-up {yi < Ci, Zi > to and 
Oj = for these individuals); 

3. individuals who did not develop the disease and were followed-up at least to years (cj < t/j, 
Zi > to and Oj = for these individuals) . 

Here and elsewhere, yi [resp. Cj] stands for the realisation of the random variable Yi [resp. C,]. 

Similarly, we will denote by §7i,uks the group consisting of individuals for whom the status is 
unknown: for these 'unknown fo-status individuals', Ci < yi, Zi < to, so O, is unobserved and 
Oi is unknown. 

Note that the rate of unknown to-status individuals increases as to increases. Therefore, the 

size of Sn.uks relatively to that of ^n,ks increases as to increases. In addition, ^n.ks as well as 
Sn.uks are unrepresentative with respect to the whole population S„, as generally. 



P(Fi<to|ieS„,ks) ^ P(yi < to|i e S„) 



and P(yi <toKG§n,uks) 7^ P(yi < to|i e S„). 



(1) 



More precisely, one can see that S„,ks overrepresents cases with respect to S„, i.e.. 



P(yi < to\i G Sn.ks) > P(yi < to\i e §„). 



(2) 
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This is the main reason why, to derive inference on § in the presence of censoring, typical tools 
(e.g., the Kaplan-Meier estimate when estimating the unconditional probability of developing 
the disease) are required to ensure unbiased estimates. 

Before presenting the four methods aimed at evaluating the caHbration of Ptf, on §, some 
additional notations are needed. Denote by n^g [resp. riuks] the number of individuals belonging 
to Sn,ks [resp. S„,uks]- Obviously, we have n = n^g + riukg (since §„,ks U §„,uks = and 
Sn,ksn§„,uks = 0)- The following quantities will be of particular interest in the sequel. Introduce 

ieSn ies„,ks ieSn,uks 

Note that the "O" terms are random and possibly unobserved (in particular, Os„ and Os„ 
are unobserved) whereas the "E" terms are non-random and known. This is classical in 
evaluation studies where inference is made given the sample, which ensures that a is non- 
random for i — I, ...,n. 

Since Og,, is unobserved, -E's„/Os„ cannot be used to estimate the theoretical £/0 ratio. We 
present four methods to get round this issue in the following paragraphs. 

2.2. Method Mo 

In some validation studies [l4],[T5], the evaluation of the calibration is restricted to S„_ks; and 
the quantity £/0 is estimated by 

7^o,„ = £^s„.ks/Os„.k,- (3) 

However, in view of ([1]) — 1[2|) , if a model has to be evaluated on §„, a lot of attention has to be 
paid when the validation is performed on S„^ks: if the score is well calibrated on S (and then 
on §„), then the expectation of the £^s„,ks/*^s,i,kE, ratio does not equal 1. In fact, it can even be 
shown that this expectation is less than 1, since the known io-status group S„^ks overrepresents 
cases with respect to S„ and § (see ^ above and ifTTj) below). 
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2.3. Method Mi 

Another method, evaluating the caHbration on the whole population §„, can be found in 
the literature (see, for instance, fS]). The underlying idea is that although Os„ is not at our 
disposal, Oi,s„ = X^ies ^{^i ^ niin(zi,to)} is. Then, setting 

-^i,Sr. = X! ei(niin(io,2j)), (4) 

the estimate of 5/0 is computed as follows 

7^1,n = Si,s„/Oi,s„ = ^i,s„/Os„.,,. (5) 

Note that ei(min(io, z^)) — ei{tQ) only for individuals i who were still disease-free after to 
years. For all other individuals (i.e., the individuals belonging to Sn^uks, plus the individuals 
belonging to S„_ks for whom Oi — 1), we have ei(min(io, z^)) — ei{zi) < ei(to)- 
To see why this method is inappropriate, we present a simple example. Assume a database 
of 10,000 individuals followed over a 5-year period (with no dropping-out) is at our disposal. 
Further suppose that the risk of the considered disease is uniform over the 5-year period, 
such that P(F < t) — t/100, for all t < 5. Then about 100 cases are likely to be observed 
each year. Assume 100 cases are observed each year (giving 500 cases observed overall) and 
the evaluation of the prediction tool Pt = t/100, for all t < 5, is under study. All the 9,500 
individuals who remained free from disease after the 5-year period contribute to 5% in the 
calculus of i?i,s„. On the other hand, all the individuals who developed the disease within the 
first year of follow-up contribute to (at most) 1%, those who developed the disease within the 
second year of follow-up to 2%, and so on. Therefore, according to Mi, the number of expected 
cases is (at most) 

100 X 1% + 100 X 2% + 100 X 3% + 100 x 4% + 100 x 5% + 9, 500 x 5% = 490, 

in such a way that 7?.i,,i = 0.98! Obviously, the bias is more severe when the disease prevalence 
is high. 
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24. Method M2 

An easy way to correct the aforementioned bias pertaining to Mi exists. In fact, Os„^^ is known 
and only Os„ is unknown. Following the idea of Mi, however, Oi,s„ = ^ies„ ^i^i — 
min(zj, io)}(= J2ies„ - ^ ^hus, 02,s„ = Os„ + Oi,s„ are known. 

Therefore, setting 

a new estimate of f/O is given by 

T^2,n = £^2,S„/02,S„ — £^2,S„/0s^^^. (7) 

Concretely, in the individuals who developed the disease before to years of follow-up 
contribute to ei(to) while they contribute to ei{zi) in ^ (keep in mind that ei{zi) < ei(to) 
because Zi < to for such individuals) . Comparing the estimates provided by Mi and M2 , it is 
easily derived that 

Note that, to our knowledge, M2 has never been used so far, although it provides a simple and 
practical way to improve Mi. However, it is not clear whether TZ2.,n is unbiased or not: the 
explicit expression of the expectation of 7^2, « (or l/7?.2,„) can not be easily derived. Moreover, 
there exists a drawback common to both Mi and M2: using either method, the evaluation 
of a crude io-year risk score can not be performed. In fact, some of the e^'s involved in the 
calculation of i?i,s„ and i?2,s„ are attached to a io-year period, while others are attached 
to a Zi-year period. The main problem arises when caHbration is adjusted for percentiles of 
predicted risk (which is quite common in evaluation studies), and is due to the fact that the 
Ci's are not comparable. In this adjusted setting, the estimation of the e^'s distribution, and 
then the derivation of their percentiles, becomes hazardous. Similar problems also arise when 
calibration is adjusted for risk factors (such as age at inclusion or personal history of the 
disease). Therefore, M2 should not be used when adjusted calibration has to be evaluated. 
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2.5. The method M3 

Keep in mind that in the absence of dropping-out, the quantity £'s„/Os^ provides a suitable 
estimate of S/O. The problem in the presence of dropping-out arises from the fact that Os„ 
is unknown. However, a natural candidate to replace Og^ is defined as follows, 

6s„ =nif„(to), (9) 

where Kn{to) is the Kaplan-Meier estimate of P(F < to) on S„. Using this [4|, an estimate of 
£/0 can be given by 

7^3,„ = (10) 

Note that since if„(to) TP{Y < tg) almost surely as 71 ^ 00, for any to < max^gg^ Zi [18], it 
is easily derived that TZs^n is asymptotically unbiased. 

Through this theoretical description of the various methods, we showed that TZo^n and 7^i,„ 
provide biased estimates. Moreover, the asymptotic unbiasedness was established for 7^3, „ but 
not for 7?.2,n, suggesting that TZs^n is the most reliable estimate of the £/0 ratio. Moreover, TZs^n 
is intuitively the most appealing estimator because it takes into account all the information 
available after to years of follow-up. These statements will be confirmed by the simulation 
studies performed in Section [3l 

2.6. Some complements 

Some additional properties of the various methods merit presentation. 

2.6.1. The inadmissibility of Mq Comparing TZo^n and TZ^^n gives insight into the magnitude 
of the bias pertaining to Mq. Under the assumption of independence between the vector of 
covariates and the censoring variable, it can be shown that 

where F„|^^(to) is the empirical distribution function on Sn^ks, i-e., the standard estimate of the 
probability of developing the disease on S„^ks- See Appendix for the proof of ifTTj) . 
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Since Kn{to) < Fn^^{to), almost surely for n large enough, we have Co(io) > 1- For instance, 
set Z = maxigs,, z^, and select to — Z. In this particular example, all the cases belong to 
Sn,ks; non-cases do not. Thus, F„^^(Z) = 1, while Kn{Z) <C 1 (typically, Kn{Z) does not 
exceed 0.2), and Cq(Z) becomes high. Even if this case is somewhat extreme, it highlights the 
inadmissibility of Mg, which is however among the most widely used of methods. 



2.6.2. Confidence intervals In order to conclude whether a given prediction tool is well 
calibrated on S or not, confidence intervals are generally needed. When the estimation of 
f/O is based on Mq, Mi or M2, such intervals can be calculated using the Poisson variance 
for the logarithm of the observed number of cases [15]. Namely, for j = 0, 1, 2, 



CI,„,95%(^/0) = 7e„„exp ± 1.96 Jl/Oi 



(12) 



Note that, since Mq and Mi lead to biased estimates of the quantity £/0, the above formula 
may only be correct for j — 2 (if, eventually, 7^2,nturns out to be unbiased). 
On the other hand, in the case of A/3, a log-transformation can be coupled with the delta- 
method, giving 



Var 



"n,to 

mo) 



where cr,^ is the Greenwood variance [12] of the Kaplan-Meier estimate evaluated at to. The 
corresponding confidence interval is given by 



Cl3,n,95%('^/C) 



T^3,n exp ( ± 1.96 



Kn{to) 



(13) 



3. SIMULATION STUDY 

A simulation study was performed to check that TZ2,n and 7^.3, „ were better estimates of the 
£/0 ratio than TZi,n and T^o.m) and to compare TZ2.n and 7^3, „. 

We considered the case where Y ^ W(0, A), for a given A > to, i-e., Y was uniformly distributed 
on the interval [0, A]. This ensured that P(r <t) = t/X, for all < t < A. Note that the higher 
the rate 1/A, the higher the "prevalence of the disease", and therefore, the higher the bias of 
72.1^,1 is expected to be (see Section [2731) . For the censoring variable, we chose C L{{0,uj\), 
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for a given uj\ > 0. To allow the rate of unknown to-status individuals to vary, we selected 
various values of uj\, depending on the rate 1/A. We also considered the case with no censure 
(and therefore with no unknown io-status individuals) to check that Mq, M2 and M3 provided 
the same estimates in this case. 

Given n > 1, samples (Yi,...,l^) and, if appropriate, (Ci,...,C„) were simulated. From 
these samples, we generated the "observed" sample ((Zi, ...(Z„, (5„)), where, as usual, 
Zi = min(yi, Ci) and Si — ^{Yi < C,}. The population §„ = {1, could then be split into 

S„,ks and Sn^uks, making the calculation of Os„ and Os„ possible. Moreover, the Kaplan- 
Meier estimate could be calculated on our samples, enabling us to compute Os„. Finally, the 
terms £"§„, £'S„kai -^i.Sn and i?2,s„, and then 7^o,n, "^i.n, "7^2, n and TZs.n, were computed 
using the formula ei{to) — to/X and ei{zi) = Zi/X, and the corresponding confidence intervals 
were constructed making use equations lfT2|) and lfT3|) . Note that, given the way the expected 
number of cases was calculated, the underlying prediction tool should be well caHbrated, and 
TZj,n should be close to one if the method Mj, j — 0,1,2,3, provided unbiased estimates of 
the £/0 ratio. 

In every example, we selected n = 20, 000 and to — 10. We repeated the procedure described 
above 1,000 times, computing (i) the mean for each of the TZj,n, j — 0, 1,2,3, (m) the mean 
width of the corresponding confidence interval and {in) the proportion of confidence intervals 
including the value 1 (which is an estimate of the covering probability of the confidence 
interval) . 

We selected A = 100, A = 200 and A = 400, and in each case, we selected three values of ujx 
such that there was 5%, 10% and 20% of unknown 10-year status individuals (plus the case 
with no censure at all): this resulted in 3 x 4 = 12 simulation designs. The results are presented 
in Table H 

First consider the mean of the point estimates obtained for each method in each simulation 
design. We observed that the estimates 7?.o,ri, ^2,ri and TZs^n were identical in the uncensored 
cases, corresponding to the cases where the rate of unknown to-status individuals was null. In 
addition, we observed that TZi^i < 1 in every case, and that the bias magnitude depended upon 



Prepared using simauth.cis 



HOW TO EVALUATE THE CALIBRATION OF A DISEASE RISK PREDICTION TOOL 



11 



the "prevalence" 1/A, independently of the rate of unknown to-status individuals. Finally, the 
error made when using T^o.n was all the higher as this rate increased (as expected again). 
All these observations confirmed the assertions presented in Section [21 The correcting terms 
presented in Table made these observations even clearer and supported the approximation 
stated in ifTTj) . Furthermore, the estimates TZ2.n and 7^.3, „ gave very similar values, which 
were close to the true value 1. These first results confirmed the fact that 7^2, n and 7^3, „ were 
better estimates of the £/0 ratio than TZo,n and TZi^n, and then that the use of the latter two 
estimators should be avoided. 

Considering in more detail 7^.2, « and 7^.3, „, we saw that the means of the 7?.3,„'s were slightly 
closer to 1 that those of the 7?.2,n's. Moreover, by comparing the width and the covering 
probability of the corresponding confidence interval, 7^.3, „ appeared to be more precise than 
7?-2,n, with narrower but still more accurate confidence intervals. Therefore, from this simple 
simulation study, the estimate 7^3^„ turned out to be the most advisable one. 

Note that the precision of 7^3, „ (as well as that of 7?.2,ri) was closely related to the prevalence 
1/A: the higher the prevalence, the more precise the estimates. 

4. CASE STUDY : THE EVALUATION OF AN EXISTING BREAST CANCER 
PREDICTION TOOL ON THE ESN COHORT 

E3N (Etude Epidemiologique des femmes de I'Education Nationale) is the French component 
of the EPIC (European Prospective Investigation into Cancer and nutrition) prospective 
study and has been thoroughly described elsewhere [6J. All participants are women belonging 
to the Mutuelle Generale de I'Education Nationale (MGEN), a health insurance scheme 
primarily covering teachers, teacher's spouses, and employees of the National Education 
System. Since June 1990, after having given informed consent, 98,995 women have been 
asked at approximately 24-month intervals to complete self-administered questionnaires, which 
include a variety of lifestyle characteristics. After the exclusion of the prevalent cases of cancer 
(n=6,999) and women who had never menstruated (n = 28), the cohort includes 91,968 
observations (with 3,467 cases of invasive breast cancer). 
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Rosner and Colditz models proposed two breast cancer risk prediction models according to 
which incidence of breast cancer at age a (la) is proportional to the number of breast cell 
divisions accumulated throughout life up to age a jl6], The rate of breast cancer cell 
division at age a' is supposed to be dependent on risk factors that are relevant at age a'. 
Rosner and Colditz thus expressed the log incidence rate of breast cancer as a linear function 
of the cumulative effect of individual breast cancer risk factors. In a first attempt [16], in 
addition to age (a), only reproductive factors were considered, namely, age at menarche (ao), 
menopausal status (to), age at menopause {am), parity s, age at first birth (ai), and a variable 
b, called birth index and defined as 6 = J2i=i{^* ~ 0'i)bi,a, where Oi is the age at zth birth, 
a* — min(a, a,„) and bi^a = 1 if parity is greater than i at age a, otherwise. Defining 6i as 1 
if s > 1, otherwise, the RCM was specified as 

log/a = a + /3oao + /3i(a* - ao) + ^2(0 - am)TO 

+/33(ai - aa)bi + P^b + fi^bia - a„Om. (14) 

The values of the parameters a, /3i, /Js estimated in [16] are recalled for convenience in Table 

[ml 

Rosner and Colditz later developed a model including more factors [2J; an evaluation study 
of the two versions can be found in Rockhill et al. [H]. We chose to evaluate the first version 
{RCM) as some of the variables involved in the extended one were not available in the ESN 
study (BMI at menarche, for instance). Moreover, our aim was to measure the respective 
performances of the methods presented in Section [2] rather than to evaluate the best published 
model. 

To compute the t-year risk (where t can take the value to or zi depending on the method to be 
used to evaluate the RCM), we proceeded as in Rockhill et al.'s evaluation study [15]: once we 
obtained, from (fT4l) . the log incidence rates for each year for each woman, we exponentiated 
each one to get an incidence rate Vj, j = 1, ...,t, for each year during the t-year period; then, 
the t-year risk was computed as 1 — exp(— [ri + ■ ■ ■ + rt]). We chose to = 10 years. In addition, 
we performed the evaluation on three groups: 

• the whole sample (n = 91, 968); 
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• Postmeno. Group 1, comprised of women who were at inclusion {n = 36,603); 

• Postmeno. 2, comprised of Postmeno. Group 1 plus the women who went through the 
menopause during the study (these women entered this group at the time of their 
menopause) {n — 82,402). 

For the whole sample {n — 91,968), the rate of unknown io-status individuals was 12% 
(nj-s = 80,883), and the Kaplan-Meier estimate of the unconditional risk of disease was 
3.21%. For Postmeno. Group 1 {n = 36,603), the rate of unknown to-status individuals was 
12.5% (nks = 32,027), and the Kaplan-Meier estimate of the unconditional risk of disease 
was 3.39%. For Postmeno. Group 2 {n — 82,402), the rate of unknown ^o-status individuals 
was 51.5% (nj-s — 39, 931), and the Kaplan-Meier estimate of the unconditional risk of disease 
was 3.60%. The results presented in Table HV] were consistent with our previous explanations. 
The estimates 7^.2, « and 7^.3, „ gave similar results, whereas TZi.n and especially TZo^n were 
slightly different and conceivably biased. Moreover, TZs.n was more precise than TZ2,n- Note 
that the bias magnitude of TZo.n was of the same order as that expected in view of the results 
of the simulation study. In fact, for the whole sample and Postmeno. Group 2, the rate of 
unknown io-status individuals was about 12 %. In our simulation study, we observed that 
Co(io) — 1.055 for 10 % of unknown to-status individuals. Here, we had Co(to) — 1-065 and 
Co(^o) = 1.074 on the whole sample and Postmeno. Group 1 respectively (on Postmeno. Group 
2, we had Co(io) = 1-43, but this feature could not be compared with our simulated results, 
since the rate of unknown fo-status individuals reached 51.5% for this group). However, the 
bias magnitude of TZi.n was slightly less important than what could have been expected from 
our simulation study. Indeed, the prevalence of the disease was about 1/300 per year (around 
1/30 over 10 years), and we calculated Co(io) — 1.01, while 1.02 was expected. This highlights 
the fact that the distribution of Y plays an important role with respect to the bias of TZi^n- 
In fact, this bias is larger for a uniform distribution than, for instance, an exponential one, 
where cases are Hkely to occur later (and in which case, the terms ei{zi) are likely to be closer 
to ei{to)). 

Note that the RCM appeared to slightly underestimate the breast cancer risk in the whole 
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ESN population. This underestimation was wider for the postmenopausal groups, especially on 
Postmeno. Group 2. The main reason might be that RCM does not take hormone replacement 
therapy (HRT) use into account. HRT is known to increase the risk of cancer [5], [6]. Moreover, 
the use of HRT is more and more frequent in the ESN population as well as in the general 
population: this means that, overall, the use of HRT is more frequent in Postmeno. Group 
2 than in Postmeno. Group 1. Therefore, this could explain (at least partly) the wider 
underestimation observed on Postmeno. Group 2. 



5. DISCUSSION 

We have presented and compared four methods aimed at evaluating the calibration of disease 
risk prediction tools. It was shown that the estimates TZ^^n and 7^2, « should be preferred to 
7?.o,„ and TZi,n, the latter two being biased in most situations. The estimator TZ^^n appeared 
to be more precise than 7<^2,n on simulated data. In addition, the unbiasedness of 7?.2,n was not 
theoretically estabHshed here, and its applicability was shown to be Hmited (in particular, it 
should not be used when calibration has to be adjusted for percentiles of predicted risks). 

Some other more sophisticated criteria (such as Hosmer and Lomeshow [11] goodness- of- fit 
statistics) may also be retained to evaluate the calibration. Here, we focused on the so- 
called E/O ratio, but the problems arising in this simple case of course still arise when more 
sophisticated criteria are used, and we recommend the use of the Kaplan-Meier estimate to 
estimate the "0" terms involved in the Hosmer-Lomeshow statistic. If this is done, however, we 
also recommend either checking the distribution of the resulting statistic or using bootstrap 
techniques to derive the associated p- value. 

The problem of individuals who dropped out before years of follow-up still arises when 
evaluating the discrimination of a to-year risk score. It has been shown that the concordance 
statistic is biased when estimated only on the known io-status group, and an unbiased estimate 
has been proposed when the underlying model is a Cox proportional hazard model with time 
under study as the time scale pO]. In other cases, no unbiased estimates have ever been 
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proposed. An alternate approach is to compute the Observed Relative Risk (ORE). To do this, 
individuals have to be sorted by predicted to-year risks. Then, the ORR is simply the ratio of 
the number of observed cases in the top decile (or quintile) of predicted to-jear risks to the 
number of observed cases in the bottom decile (or quintile) . Obviously, since observed numbers 
of cases are generally not at the statistician's disposal, Kaplan-Meier estimates (and bootstrap 
confidence intervals) are required in this setting too. 

As a conclusion, we strongly recommend the use of TZs^n as an estimate of the £/0, even if it 
is not the most commonly used estimate in the evaluation of calibration literature (especially 
in the breast cancer field). 



6. APPENDIX 

6.1. Proof of 

Our aim is first to prove l|lip. which is recalled in (fT5| below for convenience. 



First note that 



where Og^ (resp. Eg^ ^,^J is the observed (resp. expected) number of cases on §n,uks- 
Keeping in mind that Os„ — ''^ksFni.A^o) , Os„ = nKn{to)) and n — n]^s + «uks, it is 
straightforward that 

Es _ ^s„.,. / 1 + Ss„,„.,/^s„..= A En^^ (to) 



Next, introduce the following assumption: 

{H) The censoring process is independent from the covariates. 

Remark 1 . The condition (H) ensures that the distribution of the covariates is the same on 
^n.iiks (ind Sn.ks (and then on §„)■ 
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Under {H), with e„ = E^^/n, 




^uks 



"ks 



in such a way that 



(16) 
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Table 1. Results of the simulation studies. The mean of the estimate of the EjO ratio, the mean of 
the width of the corresponding confidence interval and the proportion of these intervals including the 
value 1 are given respectively for each of the four methods in each of the 12 simulation designs. Means 
were obtained from 1.000 independent samples. 

Rate of Observed Method Method Method Method 



UKSit Cases Mq Mi M2 M3 



Case 1 : 


I A = 


100 























































2,000 


1, 


.000 


0. 


.088 


0, 


,967 


0, 


.950 


0, 


,083 


0, 


.374 


1 


.000 


0, 


,088 


0, 


,967 


1, 


.000 


0, 


,083 


0, 


.957 


5% 




1,947 


0, 


.976 


0. 


,087 


0, 


,809 


0, 


.951 


0, 


,085 


0, 


.406 


1 


.001 


0, 


,089 


0, 


,955 


1, 


.001 


0, 


,084 


0, 


.946 


10% 




1,895 





.950 


0, 


,086 


0, 


,391 





.951 


0, 


,086 





.410 


1 


.002 


0, 


,090 


0, 


,967 


1 


.001 


0, 


,086 





.961 


20% 




1,787 


0, 


.893 


0. 


,083 


0, 


,003 


0, 


.953 


0, 


,088 


0, 


.462 


1 


.004 


0, 


,093 


0, 


,963 


1, 


.001 


0, 


,088 


0, 


.951 


Case 2 : 


I A = 


200 























































1,001 


1 


.000 


0, 


,124 


0, 


,954 


0, 


.975 


0, 


,121 





.876 


1 


.000 


0, 


,124 


0, 


,954 


1 


.000 


0, 


,121 





.949 


5% 




973 





.976 


0, 


,123 


0, 


,891 





.978 


0, 


,123 





.891 


1 


.003 


0, 


,126 


0, 


,960 


1 


.002 


0, 


,123 





.9,53 


10% 




948 





.949 


0, 


,121 


0, 


,620 





.976 


0, 


,124 





.898 


1 


.002 


0, 


,128 


0, 


,958 


1 


.001 


0, 


,124 





.956 


20% 




896 


0, 


.890 


0. 


,117 


0, 


,066 


0, 


.977 


0, 


,128 


0, 


.887 


1 


.003 


0, 


,132 


0, 


,959 


1, 


.001 


0, 


,128 


0, 


.955 


Case 3 : 


I A = 


400 























































500 


1 


.002 


0, 


,176 


0, 


,950 





.990 


0, 


,174 





.931 


1 


.002 


0, 


,176 


0, 


,950 


1 


.002 


0, 


,174 





.950 


5% 




488 


0, 


.976 


0. 


,174 


0, 


,907 


0, 


.989 


0, 


,176 


0, 


.939 


1 


.002 


0, 


,178 


0, 


,964 


1, 


.002 


0, 


,181 


0, 


.960 


10% 




475 


0, 


.948 


0. 


,171 


0, 


,783 


0, 


.989 


0, 


,178 


0, 


.942 


1 


.001 


0, 


,181 


0, 


,968 


1, 


.001 


0, 


,178 


0, 


.964 


20% 




448 


0, 


.893 


0. 


,166 


0, 


,313 


0, 


.992 


0, 


,184 


0, 


.965 


1 


.005 


0, 


,187 


0, 


,968 


1, 


.004 


0, 


,185 


0, 


.966 



t Unknown to-status individuals. 
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Table II. Results of the simulation studies showing the mean of the correction terms. Means were 

obtained from 1,000 independent samples. 

Rate of Correction term Correction term 



UKSit Co(to)+ Ci(to)* 



Case 1 : 


: A = 100 











1.000 


1.053 


5% 




1.025 


1.053 


10% 




1.053 


1.054 


20% 




1.120 


1.055 


Case 2 ; 


I A = 200 











1.000 


1.026 


5% 




1.026 


1.026 


10% 




1.055 


1.026 


20% 




1.124 


1.027 


Case 3 : 


I A = 400 











1.000 


1.013 


5% 




1.026 


1.013 


10% 




1.055 


1.013 


20% 




1.125 


1.013 



t Unknown to-status individuals. 
tCo(to) = Fnk.(*o)/-f!'n{to). 
*Cl(to) =7e2,n/7^1,n. 
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Table III. Coefficients of the first Rosner and Colditz model (RCM). 



Parameter 


Regression coefficient 


SE* 


a (intercept) 


-9.687 


0.265 


Po (age at menarche) 


0.048 


0.016 


Pi (min[age, age at menopause] — age at menarche) 


O.OSf 


0.004 


02 (age — age at menopause), for menopausal women 


0.050 


0.005 


03 (age at first birth — age at menarche) 


0.0f3 


0.004 


04, (birth index) 


-0.0036 


0.0009 


05 (birth index X [age — age at menopause]), for menopausal women 


-0.00020 


0.00012 



SE: standard error. 



Table IV. Evaluation of the calibration of the Rosner and Colditz 10-year risk of breast cancer 

prediction tool. Results from the ESN cohort. 



Population 


Rate of 


Observed 


Method 


Method 


Method 


Method 


for validation 


UKSI* 


Cases 


Mo 


Ml 


M2 


Ma 








]CI*] 


ICI*] 


ICI*] 


ICI*] 


Whole sample 


12.1% 


2,765 


0.889 


0.932 


0.940 


0.947 








10.839-0.941] 


10.880-0.987] 


10.887-0.996] 


10.912-0.982] 


Postmeno. Group ft 


12.5% 


1,160 


0.635 


0.672 


0.678 


0.682 








10.600-0.673] 


10.634-0.711] 


10.640-0.718] 


10.644-0.721] 


Postmeno. Group 2^ 


51.5% 


2,115 


0.417 


0.591 


0.597 


0.595 








10.394-0.442] 


10.558-0.626] 


10.564-0.633] 


10.569-0.620] 



* CI : Confidence intervals. 

t Unknown to-status individuals. 

f Postmenopausal women at inclusion. 

''Postmenopausal women during follow-up. 
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